Simulation study

Introduction

I want to create a simulation study to do two things:

  1. absolute evaluation of consensus inference as a inferential technique for mixture models; and
  2. comparative evluation of consensus, Bayesian and frequentist inference of mixture models.

I am interested in evaluating:

  • Performance - how well each method predicts the true clustering (metric: Adjusted Rand Index);
  • Robustness - how well each method explores the model space (metric: NOT KNOWN); and
  • Speed - how long each method takes to run (metric: seconds).

Data generating mechanism

I refer to the objects being clustered as items and the variables being used to cluster them as features. I use the language of Law, Jain, and Figueiredo (2003) and refer to relevant and irrelevant features referring to features that provide component specific information and thus contribute to the clustering and those that do not.

  • \(N\): the number of items generated;
  • \(P\): the number of relevant features used;
  • \(P_n\): the number of irrelevant features used;
  • \(K\): the true number of clusters present;
  • \(X = (x_1, \ldots, x_N)\): the items generated;
  • \(\pi=(\pi_1, \ldots, \pi_K)\): the expected proportions of the population belonging to each cluster;
  • \(z=(z_1, \ldots, z_n)\): the allocation variable for each item;
  • \(\theta=(\theta_1, \ldots, \theta_K)\): the parameters associated with each component; and
  • \(\phi=(\phi_1,\ldots, \phi_P)\): the indicator variable of feature relevance.

Our data generating model is a finite mixture model with \(P_n\) irrelevant features and independent features:

\[ \begin{aligned} p(x, z, \theta, \pi) &= \prod_{i=1}^N p(x_i | z_i, \theta_{z_i}) \prod_{i=1}^N p (z_i | \pi) p(\pi) p(\theta) \\ &= \prod_{i=1}^N \prod_{p=1}^P p(x_{ip} | z_i, \theta_{z_ip})^{(1 - \phi_p)} p(x_{ip} | \theta_p) ^ {\phi_p} \prod_{i=1}^N p (z_i | \pi) p(\pi) p(\theta) \end{aligned} \]

During my simulations I assume that the model in question is a mixture of Gaussians and thus \(\theta_{kp}=(\mu_{kp}, \sigma^2_{kp})\).

As each method of inference uses a common model, this data-generating mechanism is not expected to favour any one method over another.

I test seven different scenarios that change various parameters in this model. I will define the component parameters by \(\Delta_{\mu}\), the distance between the possible means in each feature, and \(\sigma^2\), a common standard deviation across all components and features. The scenarios tested are:

  1. The 2D Gaussian case (this is a sense-check);
  2. The lack-of-structure case in 2 dimensions;
  3. The base case for which scenarios 4-6 are variations;
  4. Increasing \(\sigma^2\);
  5. Increasing the number of irrelevant features;
  6. Varying the expected proportion of the total population assigned to each sub-population;
  7. The large \(N\), small \(P\) paradigm; and
  8. The small \(N\), large \(P\) case.

A more detailed description of each scenario and various sub-scenarios is given in the below table.

Scenario N P_s P_n K Delta_mu sigma2 Pi
1 Simple 2D 100 2 0 5 3.0 1 vec(1/K)
2 No structure 100 0 2 1 0.0 1 vec(1/K)
3 Base Case 200 20 0 5 1.0 1 vec(1/K)
4 Large standard deviation 200 20 0 5 1.0 3 vec(1/K)
5 Large standard deviation 200 20 0 5 1.0 5 vec(1/K)
6 Irrelevant features 200 20 10 5 1.0 1 vec(1/K)
7 Irrelevant features 200 20 20 5 1.0 1 vec(1/K)
8 Irrelevant features 200 20 100 5 1.0 1 vec(1/K)
9 Varying proportions 200 20 0 5 1.0 1 (0.5, 0.25, 0.125, 0.0675, 0.0675)
10 Large N, small P 10000 4 0 5 1.0 1 vec(1/K)
11 Large N, small P 10000 4 0 50 1.0 1 vec(1/K)
12 Large N, small P 10000 4 0 50 0.5 1 vec(1/K)
13 Small N, large P 50 500 0 5 1.0 1 vec(1/K)
14 Small N, large P 50 500 0 5 0.2 1 vec(1/K)

Measures

In this study we will be considering both within-simulation and across-simulaiton performance for the \(N_{sim}\) simulations run under each scenario.

The aim of each model is uncovering the true clustering of the data, \(C'\). The performance of each model may be judged by compaing the predicted clustering, \(C^*\), to \(C'\) using the adjusted rand index (\({ARI(C^*, C')}\)).

Robustness may be judged by the uncertainty on this estimate, but also across-model agreement (for Bayesian and Frequentist inference, but not Consensus inference). This needs thought. If Bayesian / frequentist have disagreement on predicted clustering within a simulation across model starts, or common clustering and different uncertainty? Measure Byaeisan convergence using gelman-rubin (across-chains) and autocorrelation (within-chain).

Time may be measured in seconds - relative time taken in each simulation? e.g. \(\min(t_{freq_l}) = t_{0l}\), \(\Delta t_{ml} = t_{ml} - t_{0l}\) for the lth sim. Boxplot.

Layout

Below is some pseudo-code describing my current plan for the simulations.

Question: do I need to save the datasets? If I set a seed we can re-create them, and I’m not sure why we would need them.

# Inputs:
# Scenarios:  Scenarios defining datasets within each simulation
# nSim:       The number of simulations within each scenario
# nChains:    The number of chains used for CI       
# nIter:      The numbeer of iterations within each chain
# nBayes:     The numbeer of chains run for the Bayesian method
# bayesIter:  The number of iterations to run for each Bayesian model chain
# nFreq:      The numbeer of different seeds used for the Frequentist method
runSimulations(Scenarios, nSim, nChains, nIter, nBayes, nFreq){
  
  # Loop over scenarios
  for scn in Scenarios{
  
    # The various parameters that define the current scenario
    (N, P, K, delta_mu, sigma2, pi, P_n) = scn$Parameters
    
    # Objects to hold model performance score
    ciScore             = array(0, dim = c(nIter, nChains, nSim))
    bayesScore          = matrix(0, nrow = nBayes, ncol = nSim)
    freqScore           = matrix(0, nrow = nFreq, ncol = nSim)
    
    # Matrices to hold the upper and lower bounds on the number of clusters 
    # present based upon samples recorded
    nClustLower = array(0, dim = c(nIter, nChains, nSim))
    nClustUpper = array(0, dim = c(nIter, nChains, nSim))
    bayesNClustLower = matrix(0, nrow = nBayes, ncol = nSim)
    bayesNClustUpper = matrix(0, nrow = nBayes, ncol = nSim)

    # Loop over the number of simulations
    for l in 1:nSim{
      
      # Generate a new dataset for the current simulation based upon the scenario
      set.seed(l)
      dataset, trueClusters = generateSimulationDataset(N, P, K, delta_mu, sigma2, pi, P_n)
      
      # Create an array to hold the predicted clusterings
      predClustering      = array(0, dim = c(nIter, N, nChains))
      bayesClustering     = matrix(0, nrow = nBayes, ncol = N)
      freqClustering      = matrix(0, nrow = nFreq, ncol = N)
      
      # Array to hold consensus matrices for different number of iterations
      consensusMatrices   = array(0, dim = (N, N, nIter))
      
      # List to hold the results of each Bayes model
      bayesModel          = list()
      
      # Loop over the number of chains
      for i in 1:nChains{
        
        # set the seed to differentiate chains
        set.seed(i)
        
        # run nBayes Bayesian models
        if(i <= nBayes){
          bayesModel[[i]] = bayesianMixtureModel(dataset, bayesIter)
          ciModel = bayesModel[1:nIter, ]
          
          # Create the PSM
          posteriorSimilarityMatrix = makePSM(bayesModel$clusterings)
          
          # Create a summary clustering
          bayesClustering[i, ] = clustering(posteriorSimilarityMatrix)
          
          # Uncertainty on the number of clusters present based upon MCMC samples
          # (see Sara Wade and Zoubin Ghahramani, 2015).
          bayesCredibleBall = credibleball(
            bayesClustering[i, ],
            bayesModel$clusterings
          )
          
          # Record the score of the current Bayesian model
          bayesScore[i, l] = ARI(bayesClustering[i, ], trueClusters) 
            
          bayesPredK = numberClusters(bayesClustering[i, ])
          
          # Record the the uncertainty on the number of clusters present
          bayesNClustLower[i, l] = bayesCredibleBall$lower - bayesPredK
          bayesNClustUpper[i, l] = bayesCredibleBall$upper - bayesPredK
      
        } else {
          ciModel = bayesianMixtureModel(dataset, nIter)
        }
        
        # Initialise the consensus matrix
        consensusMatrix = matrix(0, nrow = N, ncol = N)
        for j in 1:nIter{
          
   
          # The clusterings for the current chain and given 
          ciClusterings[j, , i] = ciModel$clusterings[j,]
          
          # Build a consensus matrix for the current results
          currConsensusMatrix = makeConsensusMatrix(ciModel$clusterings[j,])
          
          # Update the consensus matrix pertaining to the number of chains
          consensusMatrices[, , j] = currConsensusMatrix * (1 / i) + consensusMatrix[, , j] * ((i - 1) / i)
          
          # Need some way of doing this - Paul doesn't agree with use of 
          # ``mcclust::maxpear``, I have to think about this
          predClustering[j, , i] = clustering(consensusMatrices[, , j])
          
          # calculate how well the model has performed in uncovering the true 
          # structure
          ciScore[j, i, l] = ARI(predClustering[j, , i], trueClusters)
    
          # The predicted number of cluters present
          ciPredK = numberClusters(predClustering[j, , i])
          
          # Not sure that this is appropriate either for the same reasons as 
          # ``mcclust::maxpear`` might not be; I have to look into this properly.
          credibleBallClusters = credibleball(
            predClustering[j, , i],
            ciClusterings[j, ,]
          )
          
          # Record the lower and upper bounds on the number of clusters present
          nClustLower[j, i, l] = credibleBallClusters$lower - ciPredK
          nClustUpper[j, i, l] = credibleBallClusters$upper - ciPredK
          
        }

        # run nFreq frequentist models and record the predicted clustering
        if(i <= nFreq){
          freqModel = frequentistMixtureModel(dataset)
          freqclustering[i, ] = freqModel$clustering
          freqScore[i, l] = ARI(freqModel$clustering, trueClusters)
        }
      
      }

      medCIScore = apply(ciScore[, , l], )
      
      # Check if the different Bayesian models are converged (some test based
      # upon Gelman-Rubin and auto-correlation?)
      # see: coda::gelman.diag(), coda::autocorr.diag()
      bayesConverged = checkConvergence(bayesModel) 
      
    }

    # Make a 3D surface plot of the median score of the consensus inference 
    # results as a function of (nIter, nChains)
    plot3D(ciScore)
    
    # Make line plots of the median score bounded by the interquartile range over 
    # simulations for a subset of number of chains as a funciton of number of 
    # iterations and for a subset of number of iterations as a funciton of 
    # number of chains.
    subsetChains  = c(1, 10, 50, 100)
    subsetIter    = c(1, 100, 500, 5000)
    nChainSubsets = length(subsetChains)
    nIterSubsets  = length(subsetIter)
    
    facetLinePlot(ciScore, ~nChains:nIter, subsetChains, subsetIter)
    
    # The scores for the subset of interest
    ciSubsetScores = ciScore[subsetIter, subsetChains, ]

    ###Not sure about comparing yet - 
    ###Also, add test (coda::gelman.diag()) for convergence in Bayesian case
    # # Compare the median score of the models over simulations:
    # for(i in 1:nChainSubsets){
    #   medCIScoreChain = ciSubsetScores[ , i, ] %>%
    #     apply(1, median)
    # }
    # 
    # for(j in 1:nIterSubsets){
    #   medCiScoreIter = ciSubsetScores[j, , ] %>%
    #     apply(1, median)
    # }
    # 
    # medBayesScore = bayesScore %>%
    #   apply(1, median)
    # 
    # medFreqScore = freqScore %>%
    #   apply(1, median)
    #   
    # # Compare median score over simulations
    # boxplot(medCIScoreIter, medCIScoreIter, medBayesScore, medFreqScore)
    
    
  }

}

Example

My idea.

K <- 5
n <- 100
n_sim <- 10
n_seeds <- 50
n_iter <- 500

score_ci <- array(dim = c(n_sim, n_seeds, n_iter))

true_labels <- matrix(sample(1:K, size = n_sim * n, replace = T), nrow = n_sim)

for(l in 1:n_sim){
true_labels_l <-true_labels[l, ]


ci_out <- list()
ci_cm <- list()
pred_ci <- c()

pred_ci <- matrix(nrow = n_iter, ncol = n)

score_ci_l <- matrix(nrow = n_iter, ncol = n_seeds)

for(j in 1:n_iter){

  .old_cm <- matrix(0, nrow=n, ncol=n)
  ci_out_j <- matrix(0, nrow = n_seeds, ncol = n)

  for(i in 1:n_seeds){
    set.seed(i)
    ci_out_j[i, ] <- ciSim(true_labels_l, j, K, truth_at = 1500)
    
    # Create coclustering matrix (technically cltoSim creates a posterior similarity
    # matrix from a clustering vector, but the effect in *this* case is the same)
    .curr_cm_i <- mcclust::cltoSim(ci_out_j[i, ])
    
    
    # .curr_cm_i <- makePSM(ci_out_j[i, ])
    
    .curr_cm <- .curr_cm_i * (1/i) + .old_cm * ((i - 1) / i)
    
    score_ci[l, i, j] <- .curr_cm %>%
      mcclust::maxpear() %>%
      .$cl %>%
      mcclust::arandi(true_labels_l)
    
    .old_cm <- .curr_cm

  }

  ci_out[[j]] <- ci_out_j
  ci_cm[[j]] <- ci_cm_j <- .curr_cm # makePSM(ci_out_j)
  pred_ci[j,] <- ci_cl_j <- mcclust::maxpear(ci_cm_j)$cl
  # score_ci[j] <- mcclust::arandi(ci_cl_j, true_labels)
}

# score_ci[[l]] <- score_ci_l

}

Visualisations of results.

probs <- seq(0., 1, 0.25)

for(i in 1:n_iter){
  if(i == 1){
    surfaces <- apply(score_ci[, , i], 2, quantile,  probs = probs)
  } else {
    surfaces <- cbind(surfaces, apply(score_ci[, , i], 2, quantile,  probs = probs))
  }
}
surfaces[,1:4]
##            [,1]       [,2]       [,3]       [,4]
## 0%   0.02079530 0.02092104 0.01332191 0.02334382
## 25%  0.03887161 0.03685330 0.02236486 0.04215909
## 50%  0.04207585 0.04565432 0.03189869 0.04843739
## 75%  0.06738553 0.06763436 0.04727810 0.05226161
## 100% 0.09367318 0.09367318 0.07071668 0.13768466
z1 <- surfaces[1, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z2 <- surfaces[2, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z3 <- surfaces[3, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z4 <- surfaces[4, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)
z5 <- surfaces[5, ] %>% matrix(nrow = n_iter, ncol = n_seeds, byrow = T)

# col_pal <- colorRampPalette(c("#146EB4", "white", "#FF9900"))(100)

plot_ly(colors = col_pal) %>%
  add_surface(x = ~1:n_seeds, y = ~1:n_iter, z = ~z2, coloraxis = 'coloraxis', opacity = 0.6) %>%
  add_surface(x = ~1:n_seeds, y = ~1:n_iter, z = ~z3, coloraxis = 'coloraxis') %>%
  add_surface(x = ~1:n_seeds, y = ~1:n_iter, z = ~z4, coloraxis = 'coloraxis', opacity = 0.6) %>%
  layout(
    scene = list(
      xaxis = list(title = "Number of chains"),
      yaxis = list(title = "Number of iterations"),
      zaxis = list(title = "ARI predicted clustering to truth")
    ),
    coloraxis= list(colorscale='Viridis') #,
    # surfacecolor = list(col_pal),
  ) %>%
  colorbar(title = "ARI")

Facet line plot

plt_z2 <- surface2Df(z2, n_seeds, n_iter)
## Note: Using an external vector in selections is ambiguous.
## i Use `all_of(n_seeds)` instead of `n_seeds` to silence this message.
## i See <https://tidyselect.r-lib.org/reference/faq-external-vector.html>.
## This message is displayed once per session.
plt_z3 <- surface2Df(z3, n_seeds, n_iter)
plt_z4 <- surface2Df(z4, n_seeds, n_iter)

plt_data <- dplyr::bind_cols(
  plt_z3, 
  plt_z2 %>% dplyr::select(ARI),
  plt_z4 %>% dplyr::select(ARI)
  ) %>% 
  set_colnames(c("N_iter", "N_chains", "ARI", "ARI_lb", "ARI_ub"))

chains_used <- c(1, 5, 15, 30)
iter_used <- c(1, 10, 50, 100)
N_chain_labs <- c(paste0("Number of chains ", chains_used)) %>% 
  set_names(chains_used)
# names(cluster_labs) <- 1:5

# New facet label names for cluster variable
chain_labels <- c(paste0("Number of chains: ", chains_used))
names(chain_labels) <- chains_used


# Create the plot
p1 <- ggplot(
  plt_data %>% filter(N_chains %in% chains_used),
  aes(x = as.numeric(N_iter), y = ARI, group = N_chains)
  ) +
  geom_line()+
  geom_ribbon(aes(ymin = ARI_lb, ymax = ARI_ub, group = N_chains), colour = NA, alpha =0.3) +
  facet_wrap(~N_chains, ncol =1, labeller = labeller(N_chains = chain_labels))+
  labs(
    x = "Number of iterations" #,
    # subtitle = "Number of chains",
    # subtitle = "Median score and inter-quartile range of predicitons against truth",
    # title = "Performance of Consensus inference across simulations"
  )


# New facet label names for cluster variable
iter_labels <- c(paste0("Number of iterations: ", iter_used))
names(iter_labels) <- iter_used

p2 <- ggplot(
  plt_data %>% filter(N_iter %in% iter_used),
  aes(x = as.numeric(N_chains), y = ARI, group = N_iter)
) +
  geom_line()+
  geom_ribbon(aes(ymin = ARI_lb, ymax = ARI_ub, group = N_iter), colour = NA, alpha =0.3) +
  facet_wrap(~N_iter, ncol =1, labeller = labeller(N_iter = iter_labels)) +
  labs(
    x = "Number of chains" # ,
    # subtitle = "Number of iterations"
  )


p1 + p2 +
  plot_annotation(
    title = 'Performance of Consensus inference across simulations',
    subtitle = "Median score and inter-quartile range of ARI (predictions against truth)"
  ) 

Model validation

I was thinking (for the real data) of plotting PCA series (is there a nice term for a series over an ordered variable?). Example:

# Parameters defining generated data
K <- 5
n <- 100
p <- 20

# Generate a dataset
my_data <- generateSimulationDataset(K = K, p = p, n = n)

# PCA
pc1 <- prcomp(my_data$data)

# Include a ribbon about the inter-quantile range
p3 <- suppressWarnings(pcaSeriesPlot(pc1$x,
  labels = my_data$cluster_IDs,
  include_area = T,
  n_comp = 6
))

# New facet label names for cluster variable
cluster_labs <- c(paste0("Cluster ", 1:5))
names(cluster_labs) <- 1:5

# Create the plot
p3 +
  facet_wrap(~Cluster, labeller = labeller(Cluster = cluster_labs)) +
  labs(title = "PCA plot including medians and inter-quartile range") + 
  theme(legend.position="none")
## Warning: Removed 1400 rows containing missing values (geom_point).
## Warning: Removed 1400 row(s) containing missing values (geom_path).
## Warning: Removed 70 row(s) containing missing values (geom_path).

This is a way of visualising the different clusters and the signatures that define them; I think the visualisation of the median this way makes it easier to see that all of the clusters do separate out sensibly even if plotting it as a process across components might be misleading.

Half-baked thoughts

Thinking about possible extensions (not for now):

  1. randomising the order of rows (as collapsed Gibbs sampler, probably not necessary with random initialisations);
  2. sampling items clustered (test robustness of clusters formed); and
  3. sampling features used (moving towards multi-view clustering). This would help find the different features contributing, and would involve some more thought about how to combine different chains (possibly chains with no structure found, i.e. singletons or all in one cluster, might be dismissed and if there are many chains with this look at ignoring some of the features used; this moves towards different problems such as feature selection and does not need work for now as it complicates the problem).

Also, would combining 2. and 3. go some way to making this a biclustering method? As some items would cluster together under some features but not others we would have some information for clustering items and features.

References

Law, Martin H, Anil K Jain, and Mário Figueiredo. 2003. “Feature Selection in Mixture-Based Clustering.” In Advances in Neural Information Processing Systems, 641–48.